Graphical EDA I: continous data
Team 1
09 November 2020
Structure
1. Examining Continuous Variables
1.1. Introduction
1.2. What features might continuous variable have?
1.3. Looking for features
1.4. Comparing distributions by subgroups
1.5. What plots are there for individual continuous variables?
1.6. Plot options
1.7. Modelling and testing for continuous variables
1.8. Main points of chapter
2. Looking for Structure: Dependency Relationships and Associations
2.1. Introduction
2.2. Features you can see with scatterplots
2.3. Looking at pairs of continuous variables
2.4. Adding models: lines and smooths
2.5. Comparing groups within scatterplots
2.6. Scatterplot matrices for looking at many pairs of variables
2.7. Scatterplot options
2.8. Modelling and testing for relationships between variables
2.9. Main points
3. Investigating Multivariate Continuous Data
3.1. Introduction
3.2. What is a parallel coordinate plot (pcp)?
3.3. Features you can see with parallel coordinate plots
3.4. Parallel coordinate plots and time series
3.5. Parallel coordinate plots for indices
3.6. Options for parallel coordinate plots
3.7. Main points
1. Examining Continuous Variables
1.1. Introduction
A continuous variable in theory can take any value over its range. A handy exemple of countinuous variable can be:
- age: you can’t count “age” because it would literally take “forever”. you could be: 25 years, 10 months, 2 days, 5 hours, 4 seconds, 4 milliseconds, 8 nanoseconds and so on..
- time: it will also take “forever” to count the time; how many nanoseconds since we start reading this line?
- In practice data for continuous variables are generally rounded to some level of measurement accuracy.
Different plotting methods are used to display the data distribution for continuous variables.
The purpose of those methods is to highlight important features of data. Some plots emphasise one feature over another, some are very specialised, some require more deciphering than others.
Two possible approaches are to use a range of different plot types or to use a variety of different plots of the same type.
The plots types we are going to use in this chapter are histograms and boxplots.
1.2. What features might continuous variable have?
Let’s see a list of features for continuous variables:
- Assymetry: the distribution is skewed to the left or right;
- Outliers: one or more values that are far from the rest of the data.
- Multimodality: the distribution has more than one peak;
- Gaps: there are ranges of values within the data where no cases are recorded;
- Heaping: some values occur unexpectedly often;
- Rounding: only certain values are found because of the rounding process;
- Impossibilities: values outside the feasible range; you can’t be -1 year old;
- Errors: values that look wrong for one reason or another
Graphics are good for displaying the features**
They can provide more and different kinds of information than a set of summary statistics.
1.3. Looking for features
Next we are going to analyse a data set using histograms in order to see which features might be present and how we can find them.
The Galton’s Height Data is a famous 1885 study of Francis Galton exploring the relationship between the heights of adult children and the heights of their parents.
The galton dataset from UsingR package includes data on heights for 928 children and 205 ‘midparents’. Using the code below we are creating Children histogram and Parents histogram.
data(galton, package="UsingR")
ht <- "height (in)"
par(mfrow=c(1,2), las=1, mar=c(3.1, 4.1, 1.1, 2.1))
with(galton, {
hist(child, xlab=ht, main="Children", col="green")
hist(parent, xlab=ht, main="Parents", col="blue")})
#### First things first, we can see that: - the distributions are roughly symmetric and there is more spread amongst the children; - there appear to be no outliers; - although the histograms appear to have different binwidths they are actually the same;
To look for gaps or heaping we create next histograms:
par(mfrow=c(1,2), las=1, mar=c(3.1, 4.1, 1.1, 2.1))
with(galton, {
MASS::truehist(child, h=0.1)
MASS::truehist(parent, h=0.1)})
In both histograms there appear to be narrower gaps between values at the boundaries. Clearly the data were aggregated or reported to a limited level of precision because we can see how few height values actually occur.
Next figure displays the histograms one above the other with equal scales and binwidths. The x-axis scale limits were chosen to include the full range of the data and the y-axis limits were chosen by inspection.
library(ggplot2)
library(gridExtra)
c1 <- ggplot(galton, aes(child)) + geom_bar( binwidth=1) +
xlim(60, 75) + ylim(0, 225) + ylab("") +
geom_vline(xintercept=median(galton$child),
col="red")
p1 <- ggplot(galton, aes(parent)) + geom_bar( binwidth=1) +
xlim(60, 75) + ylim(0, 225) + ylab("") +
geom_vline(xintercept=median(galton$parent),
col="red")
grid.arrange(c1, p1)
We can see that Parents’ heights are spread less and their median is slightly higher. Although the means are very similar, the parents’ distribution is clearly less variable. This is because each parent value is an average of two values.
1.4. Comparing distributions by subgroups
In order to approach this problem, we can use Boxplots because they make such an efficient use of the space available.
Boxplots:
The boxplot() function takes in any number of numeric vectors, drawing a boxplot for each vector. We will use airquality dataset in order to see the air quality for each month; In the example, Temp can be our numeric vector. Month can be our grouping variable, so that we get the boxplot for each month separately. In our dataset, month is in the form of number 5=May, 6=June and so on).
boxplot(Temp~Month,
data=airquality,
main="Different boxplots for each month",
xlab="Month Number",
ylab="Degree Fahrenheit",
col="orange",
border="brown"
)
It is clear from the above figure that the July month (7) is relatively hotter than the rest. Although the boxplots convey a lot of information in a limited space, they have two disadvantages: If a distribution is not unimodal, you can’t see that, and it is difficult to convey information on how big the different groups are.
1.5. What plots are there for individual continuous variables?
To display continuous data graphically we can use:
- histogram is used for grouping data into intervals, and drawing a bar for each interval, shows the empirical distribution,
- boxplot for displaying individual outliers and robust statistics for the data, useful for identifying outliers and for comparisons of distributions across subgroups,
- dotplot plotting each point individually as a dot, good for spotting gaps in the data,
- rugplot plotting each point individually as a line, often used as an additional plot along the horizontal axis of another display,
- density estimate plotting an estimated density of the variable’s distribution, so more like a model than a data display,
- distribution estimate showing the estimated distribution function, useful for comparing distributions, if one is always ‘ahead’ of another,
- Q-Q plot comparing the distribution to a theoretical one, normal distribution.
1.6. Plot options
What are the options for plottings?
- Binwidths: The width of the bins. Can be specified as a numeric value or as a function that calculates width from unscaled x. Exploring multiple widths, we can find the best to ilustrate the stories in our data. This option is crucial for plotting histograms.
- Unequal binwidths: While the idea of using unequal binwidths is attractive, it is awkward to apply in practice because the it displays confusing and hard to interpret data. If you still want to do it, consider a variable transformation instead.
- Banwidth for density estimates: The bandwidth selector is a function of four arguments: The data x, a kernel string kernel, a start string start, and a support vector support. To obtain the functions associated with these strings, use get_kernel and get_start. The function should return a double. Crucial while displaying density estimates.
1.7. Modelling and testing for continuous variables
Let’s see the way of modelling and testing for continuous variable:
- Means is the most common test for continuous data is to test the mean in some way, either against a standard value, or in comparison to the means of other variables, or by subsets. Mostly a t-test is used. Alternatively medians may be tested, especially in conjunction with using boxplots.
- Symmetry. Its role is to improve the power of the test, using bootstrapping.
- Normality tests like Anderson-Darling, Shapiro-Wilk, Kolmogorov-Smirnov have low power for small samples and may be rather too powerful for really large samples. A large sample will tend to have some feature that will lead to rejection of the null hypothesis.
- Density estimation in R can be applied using one of the many R packages wich offer some for of density estimation. This model is fitting the density estimates without testing. Remember that densities for variables with strict boundaries need special treatment at the boundaries.
- Outliers How useful a outlier test might be depends on the particular application. Testing for outliers we need to prevent both masking: one group of outliers prevents you recognising another and swamping: mistaking standard observations for outliers.
- Multimodality test can be approach by the dip test. It was proposed in Hartigan and Hartigan, 1985 and it is available in R in the appropriately named package diptest.
1.8. Main points of chapter
- There are lots of different features that can arise in the frequency distributions of single continuous variables.
- There is no optimal type of plot and no optimal version of a plot type. Look at several different plots and several different versions of each.
- Natural binwidths based on context are usually a good choice for histograms.
- Histograms are good for emphasising features of the raw data, while density estimates are better for suggesting underlying models for the data.
- Boxplots are best for identifying outliers and for comparing distributions across subgroups.
2. Looking for Structure: Dependency Relationships and Associations
2.1. Introduction
Roles of scatterplots are:
- revealing associations between variables;
- identifying outliers;
- spotting distributional features.
Over ten thousand athletes competed in the London Summer Olympics of 2012. The figure below shows a scatterplot of Weight against Height from the dataset oly12 in the package VGAMdata.
library(ggplot2);
data(oly12, package="VGAMdata")
ggplot(oly12, aes(Height, Weight)) + geom_point() +
ggtitle("Athletes at the London Olympics 2012")

2.2. Features you can see with scatterplots
Features that might be found in a scatterplot include:
- casual relationships (linear and nonlinear)
- associations
- outliers or groups of outliers
- clusters
- gaps (particular combinations of values that do not occur)
- barriers
- conditional relationship
2.3. Looking at pairs of continuous variables
Drinks Wages
A scatterplot of average wages against proportion of drinkers for all 70 trades in Pearson’s DrinksWages dataset in the HistData package shows that there are a surprising number of trades where either all workers are drinkers or none are.
data(DrinksWages, package = "HistData")
ggplot(DrinksWages, aes(drinks / n, wage)) + geom_point() +
xlab("Proportion of drinkers") + xlim(0, 1) + ylim(0, 40)

Observations
over a third have only 1 or 2 members in the survey, which affects the graphic;
after excluding the smaller trades, with less than five members, there is still no particular pattern.
2.3. Looking at pairs of continuous variables (cont.)
Old faithful geyser
The dataset geyser in MASS provides 299 observations of the duration of eruptions and the time to the next eruption for the Old Faithful geyser in Yellowstone Park.
library(ggplot2);
data(geyser, package="MASS");
ggplot(geyser, aes(duration, waiting)) + geom_point();
To assess the possibility of clustering, consider a density estimate. The figure below displays contours of a bivariate density estimate supporting the idea of there being three clusters.
ggplot(geyser, aes(duration, waiting)) + geom_point() +
geom_density2d()

2.4. Adding models: lines and smooths
Pearson heights
Adding lines or smooths to scatterplots is easy and often provides valuable guidance, as can be seen comparing the figures below, where the relationship between sons’ and fathers’ heights, from dataset father.son, is difficult to assess from the scatterplot alone.
Scatterplot with best regression line and y=x diagonal
data(father.son, package="UsingR")
ggplot(father.son, aes(fheight, sheight)) + geom_point() +
geom_smooth(method="lm", colour="red") +
geom_abline(slope=1, intercept=0)
## `geom_smooth()` using formula 'y ~ x'

Plotting a smoother and the best regression line together
data(father.son, package="UsingR")
ggplot(father.son, aes(fheight, sheight)) + geom_point() +
geom_smooth(method="lm", colour="red", se=FALSE) +
stat_smooth()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

2.5. Comparing groups within scatterplots
The Olympic athletes’ dataset shown in Chapter 2.1 suffered from overplotting, which can be overcome by splitting up the data by possible explanatory variables. In this case, we can separate the plot by sex: a scatterplot for females and one for males.
ggplot(oly12, aes(Height, Weight)) +
geom_point(size = 1) + facet_wrap(~Sex, ncol=1)

Plotting all the 42 sports from the dataset together we get the following figure:
oly12S <- within(oly12, Sport <- abbreviate(Sport, 12))
ggplot(oly12S, aes(Height, Weight)) +
geom_point(size = 1) + facet_wrap( ~ Sport) +
ggtitle("Weight and Height by Sport")

2.6. Scatterplot matrices for looking at many pairs of variables
Scatterplot matrices (sploms) are tables of scatterplots with each variable plotted against all of the others. They give excellent initial overviews of the relationships between continuous variables in datasets with small numbers of variables.
Crime in the U.S.
The dataset crime.us from the VGAMdata package includes the absolute crime figures and the crime rates by population for the fifty U.S. states in 2009.
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
data(crime.us, package="VGAMdata")
crime.usR <- crime.us
names(crime.usR) <- gsub("*Rate", "", names(crime.usR))
names(crime.usR)[19:20] <- c("Larceny", "MotorVTheft");
ggpairs(crime.usR[, c(13:16, 18:20)],
title="Crime rates in the USA",
diag=list(continuous='densityDiag'), axisLabels='none')

The figure above shows a splom of the rates for seven kinds of crime. The dataset also includes rates for the four crimes of violence together and density estimates for the variables down the diagonal.
Functions for drawing sploms
| plot |
|
| pairs |
|
| spm |
car |
| cpairs |
gclus |
| splom |
lattice |
| ggpairs |
GGally |
| gpairs |
gpairs |
| pairs.mod |
SMPracticals |
| pairscor.fnc |
languageR |
| … |
… |
Note: Sploms are ineffective for large number of variables (a splom for m variables includes m(m-1) scatterplots and m diagonal elements).
2.7. Scatterplot options
Point size
- larger (for emphasising outliers);
- smaller (for distinguishing groups of points);
Symbols for points
- if the points do not overlap and are easy to identify, we should draw the members of each group with a different symbol to tell them apart;
- with large datasets, we should use a set of scatterplots for each group.
Alpha-blending
- each point is given a weight between 0 and 1;
- where several points overlap, the resulting area is drawn correspondingly darker (or more opaque).
Colouring points
- used to distinguish points by groups;
- if there are points on top of one another, the visible colour is the colour of the last point drawn.
2.8. Modelling and testing for relationships between variables
Correlation
Regression
Smoothing
Bivariate density estimation
Outliers
2.9. Main points
- Scatterplots can take many different forms and provide a lot of information about the relationship between two variables.
- Adding lines or smooths to scatterplots is easy and often provides valuable guidance, as can be seen in the plot from 2.4, where the relationship between sons’ and fathers’ heights is difficult to assess from the scatterplot alone.
- Trellis displays are very effective for comparing scatterplots by subgroup, especially when the groups overlap, like in the case of the Olympic games scatterplot.
- Sploms are excellent for giving a quick overview of a few variables. For example, the plot from 5.6 summarises the U.S. crime rates dataset well.
3. Investigating Multivariate Continuous Data
3.1. Introduction
Traditional methods used for displaying multivariate continous data:
- scatterplots. Disadvantage: only for bivariate data, not effective for exploring in higher dimensions;
- matrices of scatterplots. Disadvantage: do not really convey what multivariate structure there might be in the data;
- dimension reduction methods (PCA, MDS, etc). Disadvantage: they approximate the data, and it is difficult to assess how good the approximations are.
New Method: parallel coordinate plots.
This solution has become popular for multivariate continuous data in recent years.
3.2. What is a parallel coordinate plot (pcp)?
library('GGally'); data(food, package="MMST")
names(food) <- c("Fat", "Food.energy", "Carbohyd", "Protein", "Cholest", "Wt", "Satur.Fat")
food1 <- food/food$Wt
ggparcoord(data = food1, columns=c(1:5, 7), scale="uniminmax", alphaLines=0.2) +
xlab("") + ylab("")

In a parallel coordinate plot:
- all axes are parallel to one another;
- each variable has its own individual vertical axis (or alternatively all the axes are horizontal);
- each axis is usually scaled from the minimum to the maximum case values for the variable;
- each polygonal line across all axes defines a profile.
Functions for drawing pcp’s in R
| ggparcoord |
GGally |
| parcoord |
MASS |
| cparcoord |
gclus |
| pcp |
PairViz |
| parallelplot |
lattice |
| parallel.ade |
epade |
| ipcp |
iplots |
| … |
… |
3.3. Features you can see with parallel coordinate plots
Parallel coordinate plots are very useful for checking results like:
- outliers
- groups of cases with very similar values (clusters)
- bivariate associations between adjacent variables (correlations)
- gaps or concentrations of data
- skewness of univariate distributions for several variables at once
Illustrating Fisher’s iris dataset
ggparcoord(iris, columns=1:4, groupColumn="Species")

3.4. Parallel coordinate plots and time series
In a parallel coordinate plot used for representing time series:
- each axis represents one of the time points;
- we cannot change the order of the axis;
- a common scaling is used for all axes (scale=globalminmax).
Note: ipcp method of iplots package is better for creating interactive time series pcp’s.
Illustrating the acres of corn planted by US states from 1866 to 2011
library(reshape2); data(nass.corn, package="agridat")
c1 <- melt(nass.corn, id=c("year", "state"))
c1 <- within(c1, StateV <- interaction(state, variable))
c2 <- dcast(c1, StateV~year)
ggparcoord(subset(c2[1:48,], c2[1:48,147]> 250000), columns=2:147, groupColumn="StateV", scale="globalminmax") +
xlab("Year") + ylab("Acres") +
scale_x_discrete(breaks=seq(1865, 2015, 10)) +
theme(legend.position = "none")

3.5. Parallel coordinate plots for indices
Usually, index values are weighted combinations of the values of their components. Examples:
- stock market indices - used to represent performances of groups of shares;
- consumer price index - summarises prices across a wide range of products;
- average teaching score - an overall index used for ranking universities in UK.
One way to represent them in pcp’s:
- the index itself will have one axes associated;
- for each of the components there will be a seperate axes;
- the polygonal lines are the profiles of the individuals;
- more effective if some individuals or groups are identified by colour.
Illustrating the Russell group from the uniranks dataset
data(uniranks, package="GDAdata")
names(uniranks)[c(5, 6, 8, 10, 11, 13)] <- c("AvTeach", "NSSTeach", "SpendperSt", "Careers", "VAddScore", "NSSFeedb")
uniranks1 <- within(uniranks, StaffStu <- 1/(StudentStaffRatio))
uniranks2 <- within(uniranks1,
Rus <- ifelse(UniGroup=="Russell", "Russell", "not"))
ggparcoord(uniranks2[order(uniranks2$Rus, decreasing=FALSE),],
columns=c(5:8, 10:14),
order=c(5,12,8,9,14,6,13,7,11,10),
groupColumn="Rus", scale="uniminmax") +
xlab("") + ylab("") +
theme(legend.position = "none",
axis.ticks.y = element_blank(),
axis.text.y = element_blank()) +
scale_colour_manual(values = c("gray","red")) +
geom_line(size = 0.8)

3.6. Options for parallel coordinate plots
3.6.1. Scaling
Depending on the situation, there are different types of scaling:
- standardise each variable by substracting the mean and divinding by the standard deviation (scale=std)
- use the respective minimum and maximum values for each variable (scale=uniminmax)
- use a common scale so that the data will not be altered (scale=globalminmax)
- standardize vertical height and then center each varialbe around the median (scale=center, scaleSummary=median)
library('gridExtra'); data(body, package="gclus")
body1 <- body
names(body1) <- abbreviate(names(body), 2)
names(body1)[c(4:5, 11:13, 19:21)] <- c("CDp", "CD", "Ch", "Ws", "Ab", "Cl", "An", "Wr")
a1 <- ggparcoord(body1, columns=1:24, alphaLines=0.1) + xlab("") + ylab("")
a2 <- ggparcoord(body1, columns=1:24, scale="uniminmax", alphaLines=0.1) + xlab("") + ylab("")
a3 <- ggparcoord(body1, columns=1:24, scale="globalminmax", alphaLines=0.1) + xlab("") + ylab("")
a4 <- ggparcoord(body1, columns=1:24, scale="center", scaleSummary="median", alphaLines=0.1) + xlab("") + ylab("")
grid.arrange(a1, a2, a3, a4)

3.6.2. Outliers
Sometimes we want to redraw the plot without showing the outliers. There are 3 ways:
- remove the cases with any outliers
- trim all outlier values to the chosen limits
- restrict the plot to the chosen limits
Remove the cases with outliers (food dataset)
fc <- function(xv) {
bu <- boxplot(xv, plot=FALSE)$stats[5]
cxv <- ifelse(xv > bu, NA, xv)
bl <- boxplot(xv, plot=FALSE)$stats[1]
cxv <- ifelse(cxv < bl, NA, cxv)
}
data(food, package="MMST")
rxfood <- as.data.frame(apply(food,2,fc))
ggparcoord(data = rxfood, columns = c(1:7),
scale="uniminmax", missing="exclude",
alphaLines=0.3) + xlab("") + ylab("")

Trim all outliers to the chosen limits (food dataset)
fb <- function(xv) {
bu <- boxplot(xv, plot=FALSE)$stats[5]
rxv <- ifelse(xv > bu, bu, xv)
bl <- boxplot(xv, plot=FALSE)$stats[1]
rxv <- ifelse(rxv < bl, bl, rxv)
}
data(food, package="MMST")
rfood <- as.data.frame(apply(food,2,fb))
ggparcoord(data = rfood, columns = c(1:7),
scale="uniminmax", alphaLines=0.3)

Restrict the plot to the chosen limits (food dataset)
fd <- function(xv) {
bu <- boxplot(xv, plot=FALSE)$stats[5]
bl <- boxplot(xv, plot=FALSE)$stats[1]
dxv <- (xv - bl)/(bu - bl)
}
data(food, package="MMST")
rofood <- as.data.frame(apply(food,2,fd))
ggparcoord(data = rofood, columns = c(1:7)) +
coord_cartesian(ylim=c(0,1))

3.7. Main points
Parallel coordinate plots are a powerful multivariate display, showing all continuous variables at once;
There are many pcp display options in addition to the usual formatting options applicable to other graphics;
Pcp’s are helpful for exploring and evaluating the results of analyses such as discriminant or cluster analyses. They are also useful for checking multivariate outliers;
Pcp’s are an alternative means for presenting multiple regularly spaced time series;
Pcp’s are informative for displaying indices and their component parts together
We should draw several pcp’s to uncover the different features in a dataset (e.g. food dataset)
THANK YOU!